Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core][collective] Support customizing gloo timeout #50223

Merged
merged 3 commits into from
Mar 1, 2025

Conversation

HollowMan6
Copy link
Contributor

Why are these changes needed?

Should be used together with ray-project/pygloo#34 so that the parameters can be properly passed to gloo.

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@jcotant1 jcotant1 added the core Issues that should be addressed in Ray Core label Feb 4, 2025
Copy link
Contributor

@jovany-wang jovany-wang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LG

@jovany-wang
Copy link
Contributor

@HollowMan6 BTW, what's your usecase?

@HollowMan6
Copy link
Contributor Author

@jovany-wang Thanks for reviewing! It's for resolving the timeout issue when sending large tensors via gloo with ray collective communication lib.

@jovany-wang
Copy link
Contributor

@HollowMan6 Are you going to apply this feature into OpenRLHF or something like it?

@jovany-wang
Copy link
Contributor

Oops, maybe ask for an unit test.

@HollowMan6
Copy link
Contributor Author

Are you going to apply this feature into OpenRLHF or something like it?

Yes and probably sometime after this release. But actually I think this is widely applicable whenever users think that 30 seconds of default gloo timeout is not enough for them.

Oops, maybe ask for an unit test.

I added it to test_gloo_group_isolation.py and fixed some errors in the test file. By the way, I also make get_group_handle public, as it could be useful for users to access the group handle and more fine-grained control.

@jovany-wang

@HollowMan6 HollowMan6 force-pushed the gloo-timeout branch 4 times, most recently from 8f6f810 to 89f5920 Compare February 28, 2025 13:53
The default gloo timeout is 30 seconds, which might not be
enough for some use cases.

Should be used together with ray-project/pygloo#34
so that the parameters can be properly passed to gloo.

Signed-off-by: Hollow Man <hollowman@opensuse.org>
So that users can get access to the specific Group
and have more control over the context

Signed-off-by: Hollow Man <hollowman@opensuse.org>
Signed-off-by: Hollow Man <hollowman@opensuse.org>
@jovany-wang jovany-wang enabled auto-merge (squash) March 1, 2025 06:22
@github-actions github-actions bot added the go add ONLY when ready to merge, run all tests label Mar 1, 2025
@jovany-wang jovany-wang merged commit a04cb06 into ray-project:master Mar 1, 2025
7 checks passed
@HollowMan6 HollowMan6 deleted the gloo-timeout branch March 1, 2025 08:56
VamshikShetty pushed a commit to VamshikShetty/ray that referenced this pull request Mar 3, 2025
## Why are these changes needed?

Should be used together with
ray-project/pygloo#34 so that the parameters can
be properly passed to gloo.

<!-- Please give a short summary of the change and the problem this
solves. -->

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [X] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [X] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [X] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [X] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: Hollow Man <hollowman@opensuse.org>
Signed-off-by: vs030455 <vamshikdshetty@gmail.com>
Michaelhess17 pushed a commit to Michaelhess17/ray that referenced this pull request Mar 3, 2025
## Why are these changes needed?

Should be used together with
ray-project/pygloo#34 so that the parameters can
be properly passed to gloo.

<!-- Please give a short summary of the change and the problem this
solves. -->

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [X] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [X] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [X] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [X] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: Hollow Man <hollowman@opensuse.org>
xsuler pushed a commit to antgroup/ant-ray that referenced this pull request Mar 4, 2025
## Why are these changes needed?

Should be used together with
ray-project/pygloo#34 so that the parameters can
be properly passed to gloo.

<!-- Please give a short summary of the change and the problem this
solves. -->

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [X] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [X] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [X] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [X] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: Hollow Man <hollowman@opensuse.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants